Assignment on RPubs
Rmd on GitHub
These are the libraries we will be using.
library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(stringr)
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(tidytext)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v tibble 2.1.3 v purrr 0.3.3
## v tidyr 1.0.2 v dplyr 0.8.5
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We had discussions about which sets of data to use. The internet is filled with a plethora of data, but few were of applicable use. Linkedin was not readily accessible.
The data we ended up using is from the web site https://data.world/jobspikr/10000-data-scientist-job-postings-from-the-usa. Phil was able to obtain the csv. The data was scraped using JobsPikr. The csv consists of 10,000 records and had the following as text fields: crawl_timestamp, url, job_title, category, company_name, city, state, country, inferred_city, inferred_state, inferred_country, post_date, job_description, job_type, salary_offered, job_board, geo, cursor, contact_email, contact_phone_number, uniq_id, and html_job_desc.
This is the original csv with field names.
CSV Fieldnames
url <- "https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz"
jobs <- read_csv(url)
## Parsed with column specification:
## cols(
## .default = col_character(),
## post_date = col_date(format = ""),
## salary_offered = col_logical(),
## cursor = col_double(),
## contact_email = col_logical(),
## html_job_description = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 3 parsing failures.
## row col expected actual file
## 2070 salary_offered 1/0/T/F/TRUE/FALSE Salary Range: Â Â Â Undisclosed 'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
## 2499 job_description closing quote at end of file 'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
## 2499 NA 22 columns 13 columns 'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
jobs